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The relative solvent accessibility (RSA) of a residue in a protein 
measures the extent of burial or exposure of that residue in the 
3D structure. RSA is frequently used to describe a protein's bio- 
physical or evolutionary properties. To calculate RSA, a residue's 
solvent accessibility (SA) needs to be normalized by a suitable 
reference value for the given amino acid; several normalization 
scales have previously been proposed. However, these scales do 
not provide tight upper bounds on SA values frequently observed 
in empirical crystal structures. Instead, they underestimate the 
largest allowed SA values, by up to 20%. As a result, many em- 
pirical crystal structures contain residues that seem to have RSA 
values in excess of one. Here, we derive a new normalization scale 
that does provide a tight upper bound on observed SA values. We 
pursue two complementary strategies, one based on extensive 
analysis of empirical structures and one based on systematic enu- 
meration of biophysically allowed tripeptides. Both approaches 
yield highly congruent results that consistently exceed published 
values. We conclude that previously published SA normalization 
values were too small primarily because the conformations that 
maximize SA had not been correctly identified. As an application 
of our results, we show that empirically derived hydrophobicity 
scales are sensitive to accurate RSA calculation, and we derive 
new hydrophobicity scales that correlate well with experimentally 
measured scales. 



Significance Statement 

Many applications in biology and biochemistry require an estimate of 
the extent to which amino acids are buried within a protein structure 
or exposed to solvent. A commonly used measure for this purpose, 
relative solvent accessibility (RSA), relies on 30-year-old, widely- 
used normalization constants. We demonstrate here that these con- 
stants are too small, for some amino aicds by up to 20%. All pub- 
lished results using RSA are therefore potentially inaccurate. We cal- 
culate a new set of normalization constants that alleviates this flaw. 
To demonstrate an application, we derive improved amino-acid hy- 
drophobicity scales from the solvent accessibility of amino acids in 
protein structures. 



Introduction 

Relative solvent accessibihty (RSA) has emerged as a commonly 
used metric describing protein structure in computational molecular 
biology, with the particular application of identifying buried or ex- 
posed residues. It is defined as a residue's solvent accessibility (SA) 
normalized by a suitable maximum value for that residue. RSA was 
first introduced in the context of hydrophobicity scales derived by 
computational means from protein crystal structures [1, 2, 3, 4, 5]. 
More recently, RSA has been shown to correlate with protein evolu- 
tionary rates and has been incorporated as a parameter into models 
which determine these rates [6, 7, 8, 9, 10, 11, 12, 13]. As RSA 
straightforwardly characterizes the local environment of residues in 
protein structures, many studies have developed computational meth- 
ods to predict RSA from protein primary and/or secondary structure 
[14, 15, 16, 17]. Further applications of RSA include identification 



of surface, interior, and interface regions in proteins [18], protein- 
domain prediction [19], and prediction of deleterious mutations [20]. 

To derive a residue's RSA from its surface area, an SA normaliza- 
tion factor is needed for each amino acid. By convention, these nor- 
malization values have been derived by evaluating the surface area 
around a residue of interest X when placed between two glycines, 
to form a Gly-X-Gly tripeptide. Most commonly, the normaliza- 
tion values utilized are those previously calculated by either Rose 
et al. [2] or Miller et al. [3]. The primary distinction between these 
two sets of normalization values lies in the different and \\f dihe- 
dral backbone angles chosen when evaluating Gly-X-Gly tripeptide 
conformations. Rose et al. [2] considered tripeptides with backbone 
angles representing an average of observed and y\f angles, whereas 
Miller et al. [3] considered tripeptides in the extended conformation 
(0 = -12O°, VA=140°). 

As the number of empirically determined 3D protein crystal 
structures has grown over the years, it has become apparent that nei- 
ther the Rose [2] nor the Miller [3] scale accurately identifies the 
true upper bound for a residue's SA. In fact, virtually all amino acids 
display, on occasion, SA values in excess of the normalization SA 
values provided by either scale. Some do so quite frequently (e.g. 
R, D, G, K, P), reaching RSA values of up to 1.2. This discrepancy, 
which leads to RSA values > 1 , is generally known in the field though 
rarely acknowledged in print. 

Here, we derive a new set of SA normalization values that pro- 
vide a tight upper bound on SA values observed in biophysically 
realistic tripeptide conformations. To calculate these normalization 
values, we pursue two complementary strategies — one empirical and 
one theoretical. For the empirical approach, we mined thousands 
of 3D crystal structures and recorded the maximum SA values we 
found for each amino acid across all structures. For the theoreti- 
cal approach, we computationally built Gly-X-Gly tripeptides and 
systematically evaluated all biophysically allowed conformations to 
determine a maximum theoretical SA value. These two strategies 
yield highly congruent results and ultimately produce comparable 
normalization scales that tightly bound SA for all 20 amino acids. 
We then return to the historic motivation for RSA and investigate the 
implications of our results for hydrophobicity scales. We find that 
SA normalization affects the performance of empirically derived hy- 
drophobicity scales, and we propose new scales that show improved 
correlation with experimentally measured scales. 
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Results 

Published SA normalization values are too small. We initially as- 
sessed the accuracy of Rose's [2] and Miller's [3] SA normalization 
scales through an exhaustive survey of the SA values found in exper- 
imentally determined protein structures. We obtained a list of 2908 
high-quality PDB structures from the PISCES server [21]. We then 
calculated SA for each residue in all 2908 structures, excluding any 
chain-terminating residues. SA values were subsequently normalized 
using the scales of either Rose et al. [2] or Miller et al. [3] to obtain 
RSA. For either scale and each amino acid, we found that residues 
with RSA > 1 were not uncommon (Figure 1); RSA values exceeded 
unity by up to 20%. The amino acids that most commonly displayed 
RSA > 1 were R, D, G, K, R For those amino acids, RSA values >1 
occurred at frequencies of 1% to 3% of all residues, depending on the 
normalization scale used (Figure 1). 

To determine the underlying factors leading to RSA > 1, we 
examined the association between RSA and the following factors: 
residue neighbors, secondary structure, bond lengths, bond angles, 
and dihedral angles. For most of these quantities, we found no strong 
association with RSA. We did, however, find a clear association with 
residues' and \\f backbone angles. For example, consider the Ra- 
machandran plot of alanine (Figure 2). A noticeable cluster of high- 
RSA residues falls into the a-helix region of (^) ^ —50, \\f ^ —45. 
We found similar results for all other amino acids. Importantly, nei- 
ther Rose nor Miller derived their normalization SA values in that 
region of backbone angles. Therefore, we concluded that previous 
SA normalization scales were obtained with poorly chosen (j) and Xj/ 
angles. 



Modeling Tripeptides Yields Significantly Higher Maximum SA 
Values. To derive maximum SA values for each amino acid X, we 
computationally constructed Gly-X-Gly tripeptides and systemati- 
cally rotated them through all biophysically allowed conformations 
(see Supporting Text for details.) When constructing the tripeptides, 
we set bond lengths and angles (excluding CO, i/A, and x angles) for 
each amino acid equal to the average values observed for that amino 
acid in our reference set of 2908 PDB structures. We set co = 180°. 
We then rotated the and y/ around the X residue in discrete 1° steps, 
exhaustively enumerating all conformations except those nonexistent 
or rare in actual crystal structures. Additionally, we iterated through 
all rotamer angles x that were sterically possible with each (0, V^) 
combination. For those amino acids with more than 10 possible dis- 
tinct rotamer conformations, as determined by the Dunbruck database 
[21], we evaluated ten randomly chosen rotamer conformations. We 
recorded the maximum SA observed for each (0,1//) backbone- angle 
combination. 

Next, we compared the resulting theoretical maximum SA val- 
ues to the empirically observed maximum SA values. We binned 
both the theoretical and the empirical values into discrete 5° x 5° bins 
of (0, V^) and recorded the maximum SA in each bin. To eliminate 
nonexistent or rare conformations, we discarded for both data sets all 
bins that contained fewer than five observations in the empirical data 
set. We displayed the remaining data in side-by- side Ramachandran 
plots (Figure 3) and generally found good congruence between the 
theoretical and the empirical values for all amino acids. Regions that 
had the highest maximum SA in the theoretical data set also had the 
highest maximum SA in the empirical data set. The highest SA val- 
ues were generally observed in the a -helix region of the Ramachan- 
dran plot (Figure 3). Based on these results, we propose new maxi- 
mally exposed geometries for each amino acid (Table SI). 

We further evaluated our model's performance by directly com- 
paring theoretical and empirical maximum SA values in each (0, V^) 
bin. We calculated the difference between these two values for each 
5° X 5° bin (now including all bins with at least one observation in the 
empirical data set). We then plotted this difference against the num- 
ber of empirical observations obtained for each bin (Figure 4). We 



found that with increasing amounts of empirical data, this difference 
approached zero; the maximum SA values from both approaches con- 
verged as more data was available. Moreover, even for sparsely pop- 
ulated bins, at least some bins showed a difference near zero, regard- 
less of the number of observations in each bin. Therefore, while our 
results did improve with increasing amounts of data, they were also 
largely robust to smaller data sets. The overall maximum SA values 
for each amino acid as derived from both approaches are summarized 
in Table 1 . These maximum values were again derived excluding bins 
with fewer than five empirical observations. 

Application to Empirically Derived Hydrophobicity Scales. The 

solvent exposure of an amino acid, averaged over many occurrences 
of that amino acid in many different protein structures, should re- 
flect the amino acid's hydrophobicity. Therefore, solvent exposure 
has long been used as a means to empirically derive hydrophobicity 
scales from protein crystal structures [1, 2]. In particular. Rose et 
al. [2] derived a hydrophobicity scale by calculating the mean RSA 
for each amino acid across a set of reference crystal structures, us- 
ing the SA normalization values derived in the same work [2]. Since 
those normalization values are inaccurate, as shown above, we as- 
sessed how using our normalization values would alter the Rose hy- 
drophobicity scale. 

We first compared the Rose scale to a number of experimentally 
derived scales (Table 2). We included in the list of experimental 
scales the scale by Kyte & Doolittle [22], which is a hybrid scale 
partially based on solvent-accessibihty data from protein structures, 
and the scale by Mac Galium et al. [23], which is based on molecular- 
dynamics simulations. A brief description of each scale is given in 
the legend to Table 2. The Rose scale correlated reasonably well 
(50%-70% of variance explained) with most experimental scales. It 
correlated the highest with the scale of Fauchere & Pliska [24] (82% 
of variance explained) and it did not correlate significantly with the 
scale of Wimley et al. [25]. 

We next derived two scales based on mean RSA, calculated us- 
ing either our theoretical or our empirical SA normalization values 
(Table S2). Both of our mean RSA scales correlated well with the 
Rose scale (r = 0.96 and r = 0.97, respectively, with P < 10~^^ in 
both cases) but were not identical to it. The biggest difference arose 
for histidine, which is ranked as the 8th-most hydrophobic amino acid 
according to the Rose scale but as the 13th- or 12th-most hydrophobic 
amino acid, respectively, according to our scales. Our scales corre- 
lated more strongly than the Rose scale with all experimental scales 
considered (Table 2). For the majority of experimental scales, the 
percent variance explained increased by approximately 10 percent- 
age points using our normalization over the Rose normahzation. We 
can conclude from these results that mean RSA is a useful measure of 
amino acid hydrophobicity and that correct SA normalization is re- 
quired to assign appropriate hydrophobicity scores to all amino acids. 

One concern with using mean RSA as a measure of hydropho- 
bicity is that the RSA distribution of individual amino acids tends to 
be highly skewed (see Figure SI for an example). Hence, mean RSA 
may not accurately reflect the most common RSA values. It might be 
preferable to use instead the fraction of times an amino acid occurs in 
a buried conformation in empirical protein structures. This approach 
was originally suggested by Chothia et al. in 1976 and executed with 
the limited data available at the time [1]. 

We calculated two additional scales from our data set of 2908 
protein structures: for each of the 20 amino acids, we calculated the 
fraction of completely buried residues (100% buried, RSA = 0) and 
the fraction of 95% buried residues (RSA < 0.05) among all occur- 
rences of these amino acids in the protein structures. For most of 
the experimental scales, these two scales showed a stronger corre- 
lation than any of the scales based on mean RSA did (Table 2). The 
two main exceptions were the scale by Fauchere & Pliska [24], which 
correlated better with mean RSA, and the scale by Wimley et al. [25], 
which correlated poorly with all empirical scales. Since the Kyte & 
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Doolittle scale [22] is partly based on the fraction of buried residues, 
its strong correlation with our scales is not surprising and does not 
represent a truly independent validation of these scales. 



Discussion 

We have derived significantly improved SA normalization values. 
Our normalization values provide a tight upper bound to the largest 
observed SA values in empirical structures. By contrast, previously 
published SA normalization valules were too small, by up to 20%, 
and frequently led to RSA values > 1. We estimated the maximum 
allowed SA for each amino acid by computationally modeling Gly- 
X-Gly tripeptides, where X is the amino acid of interest, and exhaus- 
tively surveying SA over all biophysically feasible conformations. 
We found that maximally exposed conformations tend to fall into the 
a-helix region of Ramachandran plots, and that extended conforma- 
tions display some side-chain burial. The results of our modeling 
approach were consistent with maximum SA values found by survey- 
ing nearly 3000 empirical protein crystal structures. We also revisited 
the problem of deriving empirical hydrophobicity scales from protein 
structures. We found that improved SA normalization values lead to 
improved empirical hydrophobicity scales. Further, scales based on 
both mean RSA and on the fraction of buried residues correlated well 
with experimentally measured scales. Overall, the fraction of 95% 
buried residues seems to be the best-performing empirical hydropho- 
bicity scale, but mean RSA correlates well with an experimental scale 
based on side-chain transfer between octanol and water. 

Our method of obtaining SA normalization values was similar 
to the methods employed by Rose et al. [2] and by Miller et al. [3]. 
Rose et al. [2] calculated their S A normalization values by computing 
the SA of residue X in Gly-X-Gly tripeptides whose conformations 
were chosen based on the average dihedral angles from available em- 
pirical data at the time. Miller et al. [3], on the other hand, calculated 
their SA normalization values by computing the SA of an extended 
trimer structure with = — 120°, i// = 140° and with side-chain con- 
formations that were frequently observed in the empirical data. The 
key distinction between these previous approaches and ours lies in 
our exhaustive sampling of tripeptide conformations. By modeling 
all biophysically feasible discrete combinations of and i/a angles 
and varying rotamers, we identified the ideal conformations which 
yield maximum allowed S A. To pursue our modeling strategy, we de- 
veloped a program that allowed us to easily construct peptide chains 
from scratch in arbitrary conformations (see Supporting Text for de- 
tails). 

Our theoretical modeling approach to exhaustively survey tripep- 
tides has two potential shortcomings. First, for bond lengths and an- 
gles (except major dihedral angles), we used mean values observed in 
a large number of protein crystal structures. This approach neglects 
the variation around the mean, and there could be rare cases where 
unusually large bond lengths or unusual bond angles might cause SA 
to become larger than estimated here. Such scenarios would have to 
be exceedingly rare, however, since we did not find a single case in 
which the largest empirically derived maximum SA value exceeded 
the largest theoretically derived maximum SA value (Table 1). Sec- 
ond, for amino acids with more than 10 distinct rotamer conforma- 
tions, we did not exhaustively enumerate all possible conformations 
but only sampled 10 conformations at random. Thus, in principle 
it is possible that we missed a particular rotamer conformation that 
would have corresponded to a larger SA value than the maximum we 
observed. Two arguments suggest that this issue is not likely a major 
source of error. First, again, we did not find a single case in which 
the empirical maximum SA was larger than the theoretical maximum 
SA. Second, maximum SA varied slowly with and and by ex- 
haustively enumerating conformations in 1° steps, in effect we sam- 



pled the most exposed conformations multiple times, thus reducing 
the chance of missing a rare, large-SA conformation. 

As RSA calculations are based on the Gly-X-Gly framework, we 
excluded all chain terminating residues from both the empirical and 
the theoretical analysis. Even with our improved SA normalization 
values, then, chain-terminating residues may still display RSA > 1. 
We therefore recommend that future analyses making use of RSA 
similarly exclude any chain-terminating residues, as their RSA esti- 
mates will not be precise. Suitable normalization values for chain- 
terminating residues are not available at present. 

The comparison between experimentally and empirically derived 
hydrophobicity scales has been a persistent topic in biochemistry. Re- 
solving discrepancies between empirically-derived data and experi- 
mentally derived thermodynamics of hydrophobicity could provide 
crucial insight into algorithms of protein- structure prediction and de- 
no vo protein folding. Wolfenden et al. [26] were the first to propose 
an approach for reconciling both approaches by correlating the dis- 
tribution of amino acid exposure with their experimental behaviors in 
water/vapor solutions. More recently, Moelbert et al. [4] attempted 
to reconcile these disparities by correlating hydrophobic states with 
surface-exposure patterns of protein structures. Additionally, Shay- 
tan et al. [5] assessed the distribution of amino acid exposure in 
proteins to discern apparent free energies of transfer between protein 
interior and surface states, and found that free energy is highly cor- 
related with experimental hydrophobicity scales [5]. Each of these 
approaches used the SA normalization values from either Rose et 
al. [2] or Miller et al. [3]. Since the normalization SA values devel- 
oped here are more accurate, we believe that our findings are valu- 
able for determining exposure states. Using the Rose hydrophobicity 
scale as an example, we have shown here that improved SA normal- 
ization values consistently yield improved correlations with exper- 
imental scales, irrespective of the exact type of experimental scale 
considered. Of all empirical scales we analyzed, however, the frac- 
tion of 95% buried residues was most consistently strongly correlated 
with different experimental scales and thus could be considered the 
overall best-performing empirical scale. 

Further, in agreement with Shaytan et al. [5], we found that 
different experimental scales corresponded to different experimental 
scales. For example, transfer energies from water to vapor correlated 
the strongest with the fraction of 100% buried residues, while trans- 
fer energies from water to cyclohexane correlated the strongest with 
the fraction of 95% buried residues, and transfer energies from wa- 
ter to octanol correlated the strongest with mean RSA. Since mean 
RSA puts more weight on exposed residues than does the fraction of 
either 100% buried or 95% buried residues, this finding agrees with 
the three distinct types of scales found by Shaytan et al. [5]. The 
pentapetide scale by Wimley et al. [25], however, did not correlate 
well with either of the empirical scales we considered. Wimley et 
al. performed a partioning experiment between water and 1 -octanol 
using pentapeptide species, Ace-WLXLL, with X being one of the 
naturally occuring 20 amino acids. Otherwise, their set up was sim- 
lar to the one of Fauchere & Pliska [24]. By using pentapeptides 
rather than individual amino acids, the Wimley et al. hydrophobicity 
scale does not seem to accurately reflect the hydrophobic character 
of individual amino acids but rather that of the pentapeptides. 

In summary, we have presented significantly improved SA nor- 
malization values. We recommend that our theoretical normalization 
values (column 1 of Table 1) be used to normalize SA. The optimal 
hydrophobicity scale will depend on the specific application, but the 
fraction of 95% buried residues seems to be the best general-purpose 
empirical scale. 



Materials and Methods 

We obtained a set of 2908 high-quality protein crystal structures us- 
ing the PISCES server [21]. We imposed the following requirements: 
resolution of 1.8 A or less, an /?-free value < 0.25, and a pairwise mu- 
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tual sequence identity of at most 20%. For each amino-acid reasidue 
in all 2908 structures, we retrieved bond lengths, bond angles, di- 
hedral angles, peptide bond lengths, and nearest neighbors. Chain- 
terminating residues, defined as those residues whose peptide bond 
lengths with any neighboring residue was greater than six standard 
deviations from the protein's mean peptide bond length, were ex- 
cluded from all subsequent analyses. 

We used the program DSSP (2011 version) [27] to calculate 
solvent accessibility (SA) and to identify the secondary structure 
of each residue across all proteins. We calculated RSA as RSA = 
SA/Maximum SA, where "Maximum SA" corresponds to the max- 
imum SA value, as determined by the normalization scale used, for 
the focal amino acid. 

We developed a novel procedure to computationally build Gly- 
X-Gly tripeptide structures from scratch. This method is described in 
detail in Supporting Text. Briefly, we first constructed peptides in a 
defined conformation by placing each atom at the correct position in 
3D space. We then adjusted (j), Xj/, and x angles to obtain the desired 
conformation. 



We calculated empirical hydrophobicity scales on the same set 
of 2908 crystal structures. Mean RSA of each amino acid was 
calculated as the RSA averaged over all occurrences of that amino 
acid in the data set. The corresponding hydrophbicity scale was de- 
fined as 1 — (mean RSA). Fraction 100% buried was calculated for 
each amino acid as the percent of times the program DSSP reported 
SA < lA for each occurrence of that amino acid in the data set. Frac- 
tion 95% buried was calculated for each amino acid as the percent 
of times that amino acid had an RSA value < 0.05, where RSA was 
calculated using the theoretical normalization values of Table 1. 

Author Contributions 

M.A.T., A.G.M., and C.O.W. designed the research, M.A.T. and 
C.O.W. performed the research and analyzed the data, A.G.M. con- 
tributed new reagents or analytic tools, M.A.T., S.J.S., and C.O.W. 
wrote the paper. 

ACKNOWLEDGMENTS. This work was supported by NIH grant R01 GM088344 
to C.O.W. We thank Dariya Sydykova for help with the python code used to build 
model tripeptides. 



1. Chothia C (1976) The nature of the accessible and buried surfaces in proteins. J 
MolBiol :1-14. 

2. Rose GD, Geselowitz AR, Lesser GJ, Lee RH, Zehfus MH (1985) Hydrophobicity of 
amino acid residues in globular proteins. Science 229:834-838. 

3. Miller S, Janin J, Lesk AM, Chothia C (1987) Interior and surface of monomeric 
proteins. J Mol Biol 196:641-656. 

4. Moelbert S, Emberly E, Tang C (2004) Correlation between sequence hydrophobicity 
and surface-exposure pattern of database proteins. Prot Sci :752-762. 

5. Shaytan AK, Shaitan KV, Khokhlov AR (2009) Solvent accessible surface area of 
amino acid residues in globular proteins: Correlations of apparent transfer free 
engergies with experimental hydrophobicity scales. Biomacromolecules 10:1224- 
1237. 

6. Goldman N, Thorne JL, Jones DT (1998) Assessing the impact of secondary struc- 
ture and solvent accessibility on protein evolution. Genetics 149:445-458. 

7. Bloom JD, Drummond DA, Arnold FH, Wilke CO (2006) Structural determinants of 
the rate of protein evolution in yeast. Mo/ Biol Evol 23:1751-1761 . 

8. Franzosa EA, Xia Y (2009) Structural determinants of protein evolution are context- 
sensitive at the residue level. Mol Biol Evol 26:2387-2395. 

9. Zhou T, Weems M, Wilke CO (2009) Translationally optimal codons associate with 
structurally sensitive sites in proteins. Mol Biol Evol 26:1571-1580. 

10. Franzosa EA, Xia Y (2012) Independent effects of protein core size and expression 
on structure-evolution relationships at the residue level. PLoS One 7:e46602. 

11. Scherrer MP, Meyer AG, Wilke CO (2012) Modeling coding-sequence evolution 
within the context of residue solvent accessibility. BMC Evol Biol 12:179. 

12. Meyer AG, Wilke CO (2012) Integrating sequence variation and protein structure to 
identify sites under selection. Mol Biol Evol In press. 

1 3. Contant GC, Stadler PF (2009) Solvent exposure imparts similar selective pressures 
across a range of yeast proteins. Mol Biol Evol 26:1 1 55-1 1 61 . 

14. Rost B, Sander C (1994) Conservation and prediction of solvent accessibility in 
protein families. Proteins 20:216-226. 

15. Pollastri G, Baldi P, Fariselli P, Casadio R (2002) Prediction of coordination number 
and relative solvent accessibility in proteins. Proteins 47:142-153. 

1 6. Kim H, Park H (2004) Prediction of protein relative solvent accessibility with support 
vector machines and long-range interaction 3D local descriptor. Proteins 54:557- 
562. 



17. Nguyen MN, Rajapakse JC (2005) Prediction of protein relative solvent accessibility 
with a two-stage SVM approach. Proteins 59:30-37. 

18. Levy ED (2010) A simple definition of structural regions in proteins and its use in 
analyzing interface evolution. J Mol Biol 403:660-670. 

19. Cheng J, Sweredoski MJ, Baldi P (2006) DOMpro: protein domain prediction using 
profiles, secondary structure, relative solvent accessibility, and recursive neural 
networks. Data Mining and Knowledge Discovery 13:1-10. 

20. Chen H, Zhou HX (2005) Prediction of solvent accessibility and sites of deleterious 
mutations from protein sequence. NucI Acids Res 33:3193-3199. 

21. Wang G, Dunbrack RL (2003) PISCES: a protein sequence culling server. Bioinfor- 
maf/cs 19:1589-1591. 

22. Kyte J, Doolittle RF (1 982) A simple method for displaying the hydropathic character 
of a protein. J Mol Biol 157:105-132. 

23. MacCallum JL, Bennett WFD, Tieleman DP (2007) Partitioning of amino acid side 
chains into lipid bilayers: results from computer simulations and comparison to 
experiment. J Gen Physiol 129:371-377. 

24. Fauchere JL, Pliska VE (1983) Hydrophobic parameters of amino-acid side chains 
from the partitioning of N-acetyl-amino-acid amides. Eur J Med Chem 18:369-375. 

25. Wimley WC, Creamer TP, White SH (1996) Solvation energies of amino acid side 
chains and backbone in a family of host-guest pentapeptides. Biochem 35:5109- 
5124. 

26. Wolfenden R, Anderson L, Cullis PM, Southgate CCB (1981) Affinities of amino acid 
side chains for solvent water. Biochemistry :849-855. 

27. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern 
recognition of hydrogen-bonded and geometrical features. Biopolymers 22:2577- 
2637. 

28. Radzicka A, Wolfenden R (1 988) Comparing the polarities of the amino acids: side- 
chain distribution coefficients between the vapor phase, cyclohexane, 1-octanol, 
and neutral aqueous solution. B/oc/iem 27:1664-1670. 

29. Moon CP, Fleming KG (201 1) Side chain hydrophobicity scales derived from trans- 
membrane protein folding into lipid bilayers. Proc Natl Acad Sci USA 108:10174- 
10177. 



4 



Table 1. Proposed values for SA normalization (in A^), compared to previously used scales defined by Rose et al. [2] and Miller et al. [3]. 



Residue 


Theoretical 


Empirical 


Miller e^fl/. (1987) 


Roseeffl/. (1985) 


Alanine 


129.0 


121.0 


113.0 


118.1 


Arginine 


274.0 


265.0 


241.0 


256.0 


Asparagine 


195.0 


187.0 


158.0 


165.5 


Aspartate 


193.0 


184.0 


151.0 


158.7 


Cysteine 


158.0 


140.0 


140.0 


146.1 


Glutamate 


223.0 


215.0 


183.0 


186.2 


Glutamine 


224.0 


215.0 


189.0 


193.2 


Glycine 


104.0 


98.0 


85.0 


88.1 


Histidine 


209.0 


208.0 


194.0 


202.5 


Isoleucine 


197.0 


195.0 


182.0 


181.0 


Leucine 


201.0 


192.0 


180.0 


193.1 


Lysine 


237.0 


227.0 


211.0 


225.8 


Methionine 


218.0 


203.0 


204.0 


203.4 


Phenylalanine 


239.0 


228.0 


218.0 


222.8 


Proline 


159.0 


159.0 


143.0 


146.8 


Serine 


151.0 


143.0 


122.0 


129.8 


Threonine 


172.0 


161.0 


146.0 


152.5 


Tryptophan 


282.0 


265.0 


259.0 


266.3 


Tyrosine 


263.0 


255.0 


229.0 


236.8 


Valine 


174.0 


165.0 


160.0 


164.5 



Table 2. Absolute value of correlation coefficients r between empirically derived and experimentally derived hydrophobicity scales. The 
largest correlation in each row is highlighted in bold. 



Empirical scale 



Experimental scale 


Mean RSA (Rose)^ 


Mean RSA (theor)'^ 


Mean RSA (emp)^ 


100% buried^ 


95% buried^ 


Wolfenden et al/ 


0.614 


0.699 


0.687 


0.830 


0.786 


Kyte & Doolittle^ 


0.841 


0.890 


0.886 


0.956 


0.956 


Radzicka & Wolfenden^ 


0.804 


0.860 


0.857 


0.848 


0.896 


Moon & Fleming^ 


0.704 


0.768 


0.761 


0.684 


0.777 


MacCallum et al.^ 


0.804 


0.860 


0.857 


0.848 


0.898 


Fauchere & Pliska^ 


0.904 


0.911 


0.910 


0.734 


0.882 


Wimley et al^ 


0.430"^ 


0.475 


0.480 


0.325"^ 


0.425"^ 



^Mean RSA of residues in protein structures, as calculated by Rose et al. [2]. 
^Mean RSA of residues in protein structures, as given in column 2 of Table S2. 
^Mean RSA of residues in protein structures, as given in column 3 of Table S2. 
^Fraction of 100% buried residues, as given in column 4 of Table S2. 
^Fraction of 95% buried residues, as given in column 5 of Table S2. 
^Transfer energy from vapor to water [26]. 

^Hybrid scale based on transfer energy from vapor to water and on the percentages of 95% and 100% buried residues in protein structures [22]. 
^Transfer energy from cyclohexane to water [28]. 

^AAG between the folded and unfolded state of a mutated membrane-inserted protein, outer membrane phospholipase A [29]. 

^Transfer energy calculated from molecular-dynamic simulations of side-chain analogs within a bilayer [23]. 

^Transfer energy between octanol and water [24]. 

"^Transfer energy of pentapeptides between octanol and water [25]. 

^Correlation not statistically significant; all other correlations are significant at a = 0.05. 
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Table SI. Backbone conformation of maximally exposed trimer structures. Multiple rows per residue indicate alternative conformations with 
comparable solvent exposure. 



Empirical Tiieoretical 



Residue 




¥ 


(l> 


¥ 


Alanine 


-51.9° 


-37.9° 


-60° 


-15° 




-66.7° 


-13.1° 


-45° 


-40° 




-52.1° 


-33.6° 


70° 


-5° 




-58.6° 


-30.7° 






Arginine 


-79.2° 


-20.2° 


-70° 


-25° 








-70° 


-5° 








-60° 


-15° 








-55° 


-30° 








-40° 


-50° 


Asparagine 


-74.1° 


-3.6° 


-50° 


-40° 




-94.8° 


-3.4° 


-50° 


-30° 




-67.4° 


-16.8° 


70° 


-5° 


Aspartate 


-73.0° 


-11.0° 


70° 


-5° 


Cysteine 


-61.4° 


-23.4° 


-60° 


-20° 


Glutamine 


-66.8° 


-20.3° 


-60° 


-15° 








70° 


-5° 


Glutamate 


-60.2° 


-39.6° 


-60° to -55° 


-25° to -20° 








60° 


15° 








65° 


5° 


Glycine 


76.9° 


0.7° 


-65° to -60° 


-15° to -10° 


Histidine 


-93.7° 


12.9° 


-80° to -75° 


130° 








-80° 


170° 


Isoleucine 


-64.1° 


-21.9° 


-55° 


-25° 








-50° 


-40° 


Leucine 


-58.1° 


-24.8° 


-70° 


-5° 








55° 


15° 


Lysine 


-62.1° 


-24.4° 


-60° 


-15° 


Methionine 


-67.5° 


-27.5° 


-60° 


-20° 








-55° 


-30° 


Phenylalanine 


-50.3° 


135.2° 


-45° 


-45° 


Proline 


-38.3° 


-57.8° 


-55° 


-20° 








-40° 


-50° 


Serine 


-58.1° 


-27.3° 


-60° to -55° 


-25° to -15° 




-57.5° 


-36.1° 


65° 


5° 




-103.4° 


1.1° 


70° 


-5° 


Threonine 


-65.0° 


-27.9° 


-60° 


-15° 








-45° 


-45° 


Tryptophan 


-67.6° 


-9.8° 


-75° 


-15° 








-75° 


0° 








-60° to -55° 


-25° to -20° 








-50° 


-40° to -35° 


Tyrosine 


-61.7° 


-32.4° 


-50° 


-40° to -35° 


Valine 


-48.9° 


-28.0° 


-55° 


-25° 
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Table S2. Hydrophobicity scales derived in this work. 



Amino Acid 


Mean RSA 

(theor)^ 


Mean RSA 

(emp)^ 


100% buried^ 


95% buried^ 


Alanine 


0.796 


0.782 


0.223 


0.394 


Arginine 


0.646 


0.634 


0.0109 


0.0694 


Asparagine 


0.668 


0.654 


0.0445 


0.142 


Aspartate 


0.641 


0.623 


0.0274 


0.102 


Cysteine 


0.901 


0.888 


0.273 


0.539 


Glutamate 


0.597 


0.582 


0.0169 


0.0663 


Glutamine 


0.648 


0.633 


0.0264 


0.104 


Glycine 


0.745 


0.729 


0.162 


0.286 


Histidine 


0.719 


0.718 


0.0509 


0.186 


Isoleucine 


0.874 


0.873 


0.239 


0.509 


Leucine 


0.860 


0.853 


0.207 


0.478 


Lysine 


0.564 


0.545 


0.00596 


0.0275 


Methionine 


0.844 


0.833 


0.211 


0.45 


Phenylalanine 


0.869 


0.862 


0.185 


0.477 


Proline 


0.670 


0.650 


0.0583 


0.16 


Serine 


0.730 


0.715 


0.0989 


0.23 


Threonine 


0.742 


0.724 


0.0984 


0.237 


Tryptophan 


0.844 


0.834 


0.0937 


0.358 


Tyrosine 


0.817 


0.811 


0.08 


0.303 


Valine 


0.863 


0.856 


0.247 


0.488 



^Scale based on mean RSA, as calculated using the theoretially derived SA normalization values. The actual scale is defined as 1 — (mean RSA), to yield increasingly larger values for more hydrophobic residues. 
''Same scale as in (a), but calculated using the empirically derived SA normalization values. 
'^Fraction of 100% buried residues, with RSA = (corresponding to SA < lA). 
'^Fraction of 95% buried residues, with RSA < 0.05. 
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Fig. 1. Frequency of residues with RSA > 1 in empirical protein structures. Nearly all 
amino acids, and notably R, D, K, G, and P, show RSA > 1 when RSA is calculated using 
the normalization values of either Rose et al. [2] or Miller et al. [3]. 
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Fig. 2. Ramachandran plot for alanine residues in our empirical data set. Coordinates 
which correspond to RSA values > 1 are shown in black and are clearly concentrated 
around coordinates (—50°,— 45°). We therefore propose that this region contains the 
maximally exposed conformation of alanine and should be used for calculating maximum 
SA. 
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Fig. 3. Ramachandran plots for empirical and theoretical maximum SA values of alanine. 
(A) Empirical maximum SA values for each 5° by 5° bin. Only bins which contained at 
least five observations are shown. (B) Theoretical maximum SA values, as determined 
by computational modeling, shown for non-empty bins in (A). Both the empirical and the 
theoretical approach find the largest SA values in the a-helix region around (-50° , -45°). 
By contrast, the extended conformation (—120°, 140°) leads to relatively low maximum 
SA. 
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Fig. 4. Difference between theoretically and empirically determined maximum SA values 
for alanine, across 5° by 5° bins. As the amount of data per bin increases, the difference 
between theoretical and empirical maximum SA approaches zero, demonstrating that our 
two methods converged with increasing amounts of data. Furthermore, the difference be- 
tween values is frequently zero, even when little data is available for a bin. This observation 
indicates that our theoretically derived maximum SA values provide a tight bound on the 
empirically observed ones. 
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Figure SI. Distribution of RSA values for alanine. RSA was calculated using our theoretically determined normalization values. This distribution is highly non-normal with a strong right skew. Therefore, mean 
RSA is a poor measure of center for this distribution. Similarly skewed distributions are found for most amino acids. 
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